Just choose File > New File > New script and a script window will open up in the upper left of the full RStudio window.
u <- "http://blue.for.msu.edu/FOR875/data/WorldBank.csv"
WorldBank <- read.csv(u, header = TRUE, stringsAsFactors = FALSE)
names(WorldBank)
## [1] "iso2c" "country"
## [3] "year" "fertility.rate"
## [5] "life.expectancy" "population"
## [7] "GDP.per.capita.Current.USD" "X15.to.25.yr.female.literacy"
## [9] "iso3c" "region"
## [11] "capital" "longitude"
## [13] "latitude" "income"
## [15] "lending"
We will try to create a scatter plot of fertility rate versus life expectancy of countries for the year 1960.
fertility <- WorldBank$fertility.rate[WorldBank$year == 1960]
lifeexp <- WorldBank$life.expectancy[WorldBank$year == 1960]
fertility[1:10]
## [1] NA 6.928 7.671 4.425 6.186 4.550 7.316 3.109 NA 2.690
plot(lifeexp, fertility)
pop <- WorldBank$population[WorldBank$year == 1960]
region <- WorldBank$region[WorldBank$year == 1960]
First we will create the axes, etc. for the plot, but not plot the points, type="n" tells R to do this.
plot(lifeexp, fertility, type="n")
symbols(lifeexp, fertility, circles=sqrt(pop/pi), inches=0.35, bg=match(region, unique(region)))
R Markdown provides a way to include R code that read in data, create graphics, or perform analyses, in a single document which is processed to create a research paper or homework assignment or other written product.
italic bold > block quote
strikethrough A code chunk:
x <- 1:10
y <- 10:1
mean(x)
## [1] 5.5
sd(y)
## [1] 3.02765
Inline code: 10 Inline code not executed: 5+5
For an unordered list, either an asterisk, a plus sign, or a minus sign may precede list items.
For an ordered list use a numeral followed by a period and a space (1. or 2. or 3. or …) For an ordered list, the first list item will be labeled with the number or letter that you specify, but subsequent list items will be numbered sequentially.
An unordered list:
An ordered list:
use # to indicate headers. # A first *level* ~~header~~ ## A second level header
Text subscripts and superscripts: x2 + y2 103 = 1000 Mathematics examples: \(x_a\)
\(x^a\)
echo=FALSE specifies that the R code should not be printed (but any output of the R code should be printed) in the resulting document.
include=FALSE specifies that neither the R code nor the output should be printed. However, the objects created by the code chunk will be available for use in later code chunks.
eval=FALSE specifies that the R code should not be evaluated. The code will be printed unless, for example, echo=FALSE is also given as an option.error=FALSE and warning=FALSE specify that, respectively, error messages and warning messages generated by the R code should not be printed.
The comment option allows a specified character string to be prepended to each line of results. By default this is set to comment = ## which explains the two hash marks preceding results in Figure 3.3 for example. Setting comment = NA presents output without any character string prepended. That is done in most code chunks in this book.
prompt=TRUE specifies that R prompt > will be prepended to each line of R code shown in the document. prompt = FALSE specifies that command prompts should not be included.
fig.height and fig.width specify the height and width of figures generated by R code. These are specified in inches, so for example fig.height=4 specifies a four inch high figure.
code Set to R code. Knitr will replace the code in the chunk with the code in the code option.
engine ‘R’ Knitr will evaluate the chunk in the named language, e.g. engine = 'python'.
collapse If TRUE, knitr will collapse all the source and output blocks created by the chunk into a single block.
results ‘markup’ If 'hide', knitr will not display the code’s results in the final document. If 'hold', knitr will delay displaying all output pieces until the end of the chunk. If 'asis', knitr will pass through results without reformatting them (useful if results return raw HTML, etc.)
error If FALSE, knitr will not display any error messages generated by the code.
message If FALSE, knitr will not display any messages generated by the code.
warning If FALSE, knitr will not display any warning messages generated by the code.
strip.white If TRUE, knitr will remove white spaces that appear at the beginning or end of a code chunk.
autodep If TRUE, knitr will attempt to figure out dependencies between chunks automatically by analyzing object names.
cache If TRUE, knitr will cache the results to reuse in future knits. Knitr will reuse the results until the code chunk is altered.
cache.comments NULL If FALSE, knitr will not rerun the chunk if only a code comment has changed.
Think of a vector as a structure to represent one variable in a data set. For example
Vectors in R can only contain elements of one type. a vector might hold the weights, in pounds, of 7 people in a data set.
weight <- c(123, 157, 205, 199, 223, 140, 105)
weight
## [1] 123 157 205 199 223 140 105
gender <- c("female", "female", "male", "female", "male", "male", "female")
gender
## [1] "female" "female" "male" "female" "male" "male" "female"
Notice that elements of a vector are separated by commas when using the c() function to create a vector.
The c() function also can be used to add to an existing vector. add at the last element.
weight <- c(weight, 194)
weight
## [1] 123 157 205 199 223 140 105 194
gender <- c(gender, "male")
gender
## [1] "female" "female" "male" "female" "male" "male" "female" "male"
typeof(weight)
## [1] "double"
typeof(gender)
## [1] "character"
bp <- c(FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE)
bp
## [1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
typeof(bp)
## [1] "logical"
as.character(bp)
## [1] "FALSE" "TRUE" "FALSE" "FALSE" "TRUE" "FALSE" "TRUE" "FALSE"
as.numeric(bp)
## [1] 0 1 0 0 1 0 1 0
is.character(bp)
## [1] FALSE
char_bp = as.character(bp)
is.character(char_bp)
## [1] TRUE
weight.int <- as.integer(weight)
weight.int
## [1] 123 157 205 199 223 140 105 194
weight.char <- as.character(weight)
weight.char
## [1] "123" "157" "205" "199" "223" "140" "105" "194"
bp.double <- as.double(bp)
bp.double
## [1] 0 1 0 0 1 0 1 0
Values are converted to the simplest type to represent all the information Character > complex > double > integer > logical
c(4 + 3i, TRUE, "red", 6) # will become character
## [1] "4+3i" "TRUE" "red" "6"
c(4 + 3i, TRUE, 4/5, 6) # will become complex
## [1] 4.0+3i 1.0+0i 0.8+0i 6.0+0i
c(4 + 3, TRUE, 4/5, 6.0) # will become double
## [1] 7.0 1.0 0.8 6.0
c(4 + 3, TRUE, 6) # will become Integer
## [1] 7 1 6
yy <- c(1, 2, 3, "dog")
yy
## [1] "1" "2" "3" "dog"
weight[5]
## [1] 223
weight[1:3]
## [1] 123 157 205
length(weight)
## [1] 8
weight[length(weight)]
## [1] 194
weight[3] <- 200000
weight
## [1] 123 157 200000 199 223 140 105 194
Negative numbers in the square brackets tell R to omit the corresponding value.
weight[-3]
## [1] 123 157 199 223 140 105 194
lessWeight <- weight[-c(1, 3, 5)]
lessWeight
## [1] 157 199 140 105 194
weight
## [1] 123 157 200000 199 223 140 105 194
Categorical variables such as gender can be represented as character vectors. In many cases this simple representation is sufficient. Factors in R provide a more sophisticated way to represent categorical variables.
age <- c("middle age", "senior", "middle age", "senior", "senior", "senior", "senior", "middle age")
age
## [1] "middle age" "senior" "middle age" "senior" "senior"
## [6] "senior" "senior" "middle age"
income <- c("lower", "lower", "upper", "middle", "upper", "lower", "lower", "middle")
income
## [1] "lower" "lower" "upper" "middle" "upper" "lower" "lower" "middle"
age <- factor(age, levels=c("youth", "young adult", "middle age", "senior"))
age
## [1] middle age senior middle age senior senior senior
## [7] senior middle age
## Levels: youth young adult middle age senior
income <- factor(income, levels=c("lower", "middle", "upper"),ordered = TRUE)
income
## [1] lower lower upper middle upper lower lower middle
## Levels: lower < middle < upper
x = c(2,3,4)
y = c(5,6,7)
y + x
## [1] 7 9 11
x_small = c(1,2,3)
y_big = c(1,2,3,4,5,6)
x_small + y_big
## [1] 2 4 6 5 7 9
x <- weight
x <- c(x, NA, NA)
sum(is.na(x))
## [1] 2
x[100]
## [1] NA
x = 1:10
mean(na.rm=FALSE, x, trim=0)
## [1] 5.5
y <- mpg[, 2]
dim(mpg)
## [1] 234 11
z <- rep(0, length = dim(mpg)[2])
for (i in 1:dim(mpg)[2]){
z[i] <- sum(mpg[, i] < 5)}
z
## [1] 0 6 196 0 81 0 103 0 0 0 5
a <- c(3, 6, 9, 12)
b <- matrix(c(1:4), nrow = 2, ncol = 2)
c <- c("good", "bad", "indifferent")
my.list <- list(a, b, c)
my.list
## [[1]]
## [1] 3 6 9 12
##
## [[2]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[3]]
## [1] "good" "bad" "indifferent"
my.list[[1]][3]
## [1] 9
my.list[[2]][2, ]
## [1] 2 4
missingCharacter <- c("dog", "cat", NA, "pig", NA, "horse")
missingCharacter
## [1] "dog" "cat" NA "pig" NA "horse"
is.na(missingCharacter)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
remove the missing value(s) and then perform the computation.
mean(c(1, 2, 3, NA, 5), na.rm = TRUE)
## [1] 2.75
1. NaN` represents the result of a calculation where the result is undefined, such as dividing zero by zero.
2. ### Inf and -Inf represent infinity and negative infinity (and numbers which are too large in magnitude to be represented as floating point numbers).
data is rectangular in form, with variables as columns and cases as rows.
gender <- c("female", "female", "male", "female", "male", "male", "female")
bp <- c(FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE)
weight <- c(123, 157, 205, 199, 223, 140, 105)
healthData <- data.frame(Weight = weight, Gender=gender, bp.meds = bp, stringsAsFactors=FALSE)
healthData
## Weight Gender bp.meds
## 1 123 female FALSE
## 2 157 female TRUE
## 3 205 male FALSE
## 4 199 female FALSE
## 5 223 male TRUE
## 6 140 male FALSE
## 7 105 female TRUE
names(healthData)
## [1] "Weight" "Gender" "bp.meds"
colnames(healthData)
## [1] "Weight" "Gender" "bp.meds"
names(healthData) <- c("Wt", "Gdr", "bp")
healthData
## Wt Gdr bp
## 1 123 female FALSE
## 2 157 female TRUE
## 3 205 male FALSE
## 4 199 female FALSE
## 5 223 male TRUE
## 6 140 male FALSE
## 7 105 female TRUE
rownames(healthData)
## [1] "1" "2" "3" "4" "5" "6" "7"
names(healthData) <- c("Weight", "Gender", "bp.meds")
healthData
## Weight Gender bp.meds
## 1 123 female FALSE
## 2 157 female TRUE
## 3 205 male FALSE
## 4 199 female FALSE
## 5 223 male TRUE
## 6 140 male FALSE
## 7 105 female TRUE
Data frames are two-dimensional, so to access a specific element (or elements) we need to specify both the row and column.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars[3, 4]
## [1] 93
row.names(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
mtcars[1:3, 2:3]
## cyl disp
## Mazda RX4 6 160
## Mazda RX4 Wag 6 160
## Datsun 710 4 108
mtcars["cyl"]
## cyl
## Mazda RX4 6
## Mazda RX4 Wag 6
## Datsun 710 4
## Hornet 4 Drive 6
## Hornet Sportabout 8
## Valiant 6
## Duster 360 8
## Merc 240D 4
## Merc 230 4
## Merc 280 6
## Merc 280C 6
## Merc 450SE 8
## Merc 450SL 8
## Merc 450SLC 8
## Cadillac Fleetwood 8
## Lincoln Continental 8
## Chrysler Imperial 8
## Fiat 128 4
## Honda Civic 4
## Toyota Corolla 4
## Toyota Corona 4
## Dodge Challenger 8
## AMC Javelin 8
## Camaro Z28 8
## Pontiac Firebird 8
## Fiat X1-9 4
## Porsche 914-2 4
## Lotus Europa 4
## Ford Pantera L 8
## Ferrari Dino 6
## Maserati Bora 8
## Volvo 142E 4
Note that mtcars[,1] returns ALL elements in the first column.
mtcars[, 1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
For a data frame there is another way to access specific columns, using the $ notation.
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
Technically a list is a vector, but one in which elements can be of different types.
Key Properties:
my_list[[2]]
my_list[2] will return 2 element as listThe lm function returns a list
mpgHpLinMod <- lm(mpg ~ hp, data = mtcars)
mode(mpgHpLinMod)
## [1] "list"
names(mpgHpLinMod)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
mpgHpLinMod
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Coefficients:
## (Intercept) hp
## 30.09886 -0.06823
mpgHpLinMod$coefficients
## (Intercept) hp
## 30.09886054 -0.06822828
The list function is used to create lists.
temporaryList <- list(first=weight, second=healthData, pickle=list(a = 1:10, b=healthData))
temporaryList$first
## [1] 123 157 205 199 223 140 105
temporaryList[[1]]
## [1] 123 157 205 199 223 140 105
temporaryList[[1]][1]
## [1] 123
temporaryList$second
## Weight Gender bp.meds
## 1 123 female FALSE
## 2 157 female TRUE
## 3 205 male FALSE
## 4 199 female FALSE
## 5 223 male TRUE
## 6 140 male FALSE
## 7 105 female TRUE
temporaryList[c(1, 2)] # returns a list with 1st and 2nd elements
## $first
## [1] 123 157 205 199 223 140 105
##
## $second
## Weight Gender bp.meds
## 1 123 female FALSE
## 2 157 female TRUE
## 3 205 male FALSE
## 4 199 female FALSE
## 5 223 male TRUE
## 6 140 male FALSE
## 7 105 female TRUE
temporaryList[[c(1, 2)]]
## [1] 157
temporaryList[2]
## $second
## Weight Gender bp.meds
## 1 123 female FALSE
## 2 157 female TRUE
## 3 205 male FALSE
## 4 199 female FALSE
## 5 223 male TRUE
## 6 140 male FALSE
## 7 105 female TRUE
my_list = list(1, 2, 3, c(3, 5, 6), c("a", "b", "c"))
my_list[1]
## [[1]]
## [1] 1
my_list[[4]][1]
## [1] 3
my_list[[c(4, 2)]]
## [1] 5
weight
## [1] 123 157 205 199 223 140 105
gender
## [1] "female" "female" "male" "female" "male" "male" "female"
gender[weight > 200]
## [1] "male" "male"
lightweight <- weight[weight < 200]
lightweight
## [1] 123 157 199 140 105
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
x[x < 5] <- 0
x
## [1] 0 0 0 0 5 6 7 8 9 10
Example:
x = c(-3:3)
x
## [1] -3 -2 -1 0 1 2 3
z <- c(T, F, F, T, F, F, F)
x[z]
## [1] -3 0
** We can use <, >, ==, >=, <= != to create conditions with & and I **
healthData
## Weight Gender bp.meds
## 1 123 female FALSE
## 2 157 female TRUE
## 3 205 male FALSE
## 4 199 female FALSE
## 5 223 male TRUE
## 6 140 male FALSE
## 7 105 female TRUE
healthData$Weight[healthData$Gender == "male"]
## [1] 205 223 140
healthData[healthData$Weight > 190, 2:3]
## Gender bp.meds
## 3 male FALSE
## 4 female FALSE
## 5 male TRUE
temporaryDataFrame <- data.frame(V1 = c(1, 2, 3, 4, NA), V2 = c(NA, 1, 4, 5, NA), V3 = c(1, 2, 3, 5, 7))
temporaryDataFrame
## V1 V2 V3
## 1 1 NA 1
## 2 2 1 2
## 3 3 4 3
## 4 4 5 5
## 5 NA NA 7
is.na(temporaryDataFrame)
## V1 V2 V3
## [1,] FALSE TRUE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,] TRUE TRUE FALSE
rowSums(is.na(temporaryDataFrame))
## [1] 1 0 0 0 2
colSums(is.na(temporaryDataFrame))
## V1 V2 V3
## 1 2 0
The seq() function generates either a sequence of pre-specified length or a sequence with pre-specified increments.
seq(from = 0, to = 1, length = 11)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 1, to = 5, by = 1/3)
## [1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000
## [8] 3.333333 3.666667 4.000000 4.333333 4.666667 5.000000
The rep() function replicates the values in a given vector.
rep(c(1, 2, 4), length = 9)
## [1] 1 2 4 1 2 4 1 2 4
rep(c(1, 2, 4), times = 3)
## [1] 1 2 4 1 2 4 1 2 4
rep(c("a", "b", "c"), times = c(3, 2, 7))
## [1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "c" "c" "c"
Learning about R’s programming capabilities is an important step in gaining facility with functions.
u.corn <- "http://blue.for.msu.edu/FOR875/data/corn.csv"
corn <- read.csv(u.corn, header = TRUE)
corn
## regular kiln_dried
## 1 1903 2009
## 2 1935 1915
## 3 1910 2011
## 4 2496 2463
## 5 2108 2180
## 6 1961 1925
## 7 2060 2122
## 8 1444 1482
## 9 1612 1542
## 10 1316 1443
## 11 1511 1535
paired_t <- function(x1, x2) {
n <- length(x1)
dbar <- mean(x1 - x2)
s_d <- sd(x1 - x2)
tstat <- dbar/(s_d/sqrt(n))
pval <- 2 * (1 - pt(abs(tstat), n - 1))
margin <- qt(0.975, n - 1) * s_d/sqrt(n)
lcl <- dbar - margin
ucl <- dbar + margin
return(list(tstat = tstat, pval = pval, lcl = lcl, ucl = ucl))
}
paired_t(x1 = corn$kiln_dried, x2 = corn$regular)
## $tstat
## [1] 1.690476
##
## $pval
## [1] 0.1218166
##
## $lcl
## [1] -10.7271
##
## $ucl
## [1] 78.18164
A function can be written in any text editor, saved as a plain text file (possibly with a .r extension), and then read into R using the source() function.
The argument to the if() function is evaluated. If the argument returns TRUE the ensuing code is executed.
c(FALSE, TRUE, FALSE) || c(TRUE, FALSE, FALSE)
## [1] TRUE
c(FALSE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
## [1] TRUE TRUE FALSE
c(FALSE, TRUE, FALSE) && c(TRUE, TRUE, FALSE)
## [1] FALSE
c(FALSE, TRUE, FALSE) & c(TRUE, TRUE, FALSE)
## [1] FALSE TRUE FALSE
Sign <- function(x) {
if (x < 0) {
print("the number is negative")
}else if (x > 0) {
print("the number is positive")
}else {
print("the number is zero")
}
}
Sign(3)
## [1] "the number is positive"
Sign(-3)
## [1] "the number is negative"
Sign(0)
## [1] "the number is zero"
R does not perform exact arithmetic
2^-30
## [1] 9.313226e-10
2^-30 + (2^30 - 2^30)
## [1] 9.313226e-10
(2^-30 + 2^30) - 2^30
## [1] 0
1.5 - 1.4 == 0.1
## [1] FALSE
all.equal((1.5 - 1.4), 0.1)
## [1] TRUE
Loops are an important component of any programming language, including R. Vectorized calculations and functions such as apply() make loops a bit less central to R than to many other languages.
A repeat loop just repeats a given expression over and over again until a break statement is encountered.
k <- 1
repeat {
if (1 == 1 + 1/2^k){
break
}else {
k <- k + 1
}
}
k
## [1] 53
A while loop has the form while (condition) expression As long as the condition is TRUE the expression is evaluated. Once the condition is FALSE control is transferred outside the loop.
k <- 1
while (1 != 1 + 1/2^k) {
k <- k + 1
}
k
## [1] 53
A for loop has the form for (variable in vector) expression The for loop sets the variable equal to each element of the vector in succession, and evaluates the expression each time
x <- 1:10
S <- 0
for (i in 1:length(x)){
S <- S + x[i]
}
S
## [1] 55
ggplot ( data =<dataframe>, mapping = aes(<Mappings1>)) + <Geom_fen>(aes(<Mappings2>)) + <Facet_function> facet_wrap(~var) facet_grid(~var1, var2) + labs (x, y, title, caption, subtitle, color, size)
data("mtcars")
ggplot(data = mtcars,
mapping = aes(x=hp, y=mpg)) +
geom_point(aes(color = factor(cyl),
shape = factor(am)),
size=5)
data("mtcars")
ggplot(data = mtcars,
mapping = aes(x=hp, y=mpg)) +
geom_point(aes(color = factor(cyl),
shape = factor(am)),
size=5) +
facet_wrap(~factor(cyl))
Scatter plots are a workhorse of data visualization and provide a good entry point to the ggplot2 system.
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
library(ggplot2)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 4, aes(color=Species, shape=Species))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size=3, aes(color=Species)) +
stat_smooth(method = lm, se=FALSE)
For the iris data, it probably makes more sense to fit separate lines by species. This can be specified using the aes() function inside stat_smooth().
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size=3, aes(color=Species)) +
stat_smooth(method = lm, se=FALSE, aes(color=Species))
Another common use of line segments in a graphic is to connect the points in order, accomplished via the geom_line() function.
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 4, aes(color=Species, shape = Species)) +
geom_line()
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(size = 4, aes(color=Species)) +
geom_line(aes(color=Species))
u.crime <- "http://blue.for.msu.edu/FOR875/data/crimeRatesByState2005.csv"
crime <- read.csv(u.crime, header=TRUE)
str(crime)
## 'data.frame': 50 obs. of 9 variables:
## $ state : Factor w/ 50 levels "Alabama ","Alaska ",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ murder : num 8.2 4.8 7.5 6.7 6.9 3.7 2.9 4.4 5 6.2 ...
## $ Forcible_rate : num 34.3 81.1 33.8 42.9 26 43.4 20 44.7 37.1 23.6 ...
## $ Robbery : num 141.4 80.9 144.4 91.1 176.1 ...
## $ aggravated_assult : num 248 465 327 387 317 ...
## $ burglary : num 954 622 948 1085 693 ...
## $ larceny_theft : num 2650 2599 2965 2711 1916 ...
## $ motor_vehicle_theft: num 288 391 924 262 713 ...
## $ population : int 4627851 686293 6500180 2855390 36756666 4861515 3501252 873092 18328340 9685744 ...
Here we use labs() to change the x and y axis labels and other descriptive text.
ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft)) +
geom_point() +
labs(x = "Burglaries per 100,000 population",
y = "Motor vehicle theft per 100,000 population",
title = "Burglaries vs motor vehicle theft for US states",
subtitle = "2005 crime rates and 2008 population",
caption = "Data from Nathan Yau http://flowingdata.com"
)
ggplot also provides default axis extents (i.e., limits) and other axis features. These, and other axis features such as tick marks, labels, and transformations, can be changed using the scale functions
ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft)) +
geom_point() +
scale_x_continuous(name="Burglaries per 100,000 population",
limits=c(0,max(crime$burglary))) +
scale_y_continuous(name="Motor vehicle theft per 100,000 population",
limits = c(0, max(crime$motor_vehicle_theft)))
ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft,
size=population/100000)) +
geom_point(color = "blue") +
geom_label(aes(label = state), alpha = 0.5) +
scale_x_continuous(name="Burglaries per 100,000 population",
limits=c(0,max(crime$burglary))) +
scale_y_continuous(name="Motor vehicle theft per 100,000 population",
limits = c(0, max(crime$motor_vehicle_theft))) +
labs(size="Populationnn(100,000)")
library(ggrepel)
ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft,
size=population/100000)) +
geom_point(color = "blue") +
scale_x_continuous(name="Burglaries per 100,000 population",
limits=c(0,max(crime$burglary))) +
scale_y_continuous(name="Motor vehicle theft per 100,000 population",
limits = c(0, max(crime$motor_vehicle_theft))) +
labs(size="Populationnn(100,000)") +
ggrepel::geom_label_repel(aes(label = state), alpha = 0.5)
geom_raster: heatmap (x,y) geom_density: density plot
Simon Newcomb conducted several experiments to estimate the speed of light by measuring the time it took for light to travel from his laboratory to a mirror at the base of the Washington Monument, and then back to his lab.
u.newcomb <- "http://blue.for.msu.edu/FOR875/data/Newcomb.csv"
Newcomb <- read.csv(u.newcomb, header = TRUE)
head(Newcomb)
## Time
## 1 28
## 2 26
## 3 33
## 4 24
## 5 34
## 6 -44
ggplot(Newcomb, aes(x = Time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Newcomb, aes(x = Time)) +
geom_histogram(binwidth = 5, color = "black", fill = "blue" )
library(gapminder)
ggplot(data = subset(gapminder, year == 2002),
aes(x = continent, y = gdpPercap)) +
geom_boxplot(color = "black", fill = "lightblue")
ggplot(data = subset(gapminder, year == 2002),
aes(x = continent, y = gdpPercap)) +
geom_boxplot(color = "red", fill = "lightblue") +
scale_x_discrete(name = "Continent") +
scale_y_continuous(name = "Per Capita GDP") + coord_flip()
As part of a study, elementary school students were asked which was more important to them: good grades, popularity, or athletic ability. Here is a brief look at the data.
u.goals <- "http://blue.for.msu.edu/FOR875/data/StudentGoals.csv"
StudentGoals <- read.csv(u.goals, header = TRUE)
head(StudentGoals)
## Gender Grade Age Race Type School Goals Grades Sports Looks Money
## 1 boy 5 11 White Rural Elm Sports 1 2 4 3
## 2 boy 5 10 White Rural Elm Popular 2 1 4 3
## 3 girl 5 11 White Rural Elm Popular 4 3 1 2
## 4 girl 5 11 White Rural Elm Popular 2 3 4 1
## 5 girl 5 10 White Rural Elm Popular 4 2 1 3
## 6 girl 5 11 White Rural Elm Popular 4 2 1 3
First a simple bar graph of the most important goal chosen is drawn, followed by a stacked bar graph which also includes the student’s gender, followed by a side by side bar graph which includes the student’s gender.
ggplot(StudentGoals, aes(x = Goals)) + geom_bar()
ggplot(StudentGoals, aes(x = Goals, fill = Gender)) + geom_bar()
ggplot(StudentGoals, aes(x = Goals, fill = Gender)) +
geom_bar(position = "dodge")
In this example R counted the number of students who had each goal and used these counts as the height of the bars. Sometimes the data contain the bar heights as a variable. For example, we create a bar graph of India’s per capita GDP with separate bars for each year in the data
ggplot(subset(gapminder, country == "India"), aes(x = year, y = gdpPercap)) +
geom_bar(stat = "identity", color = "black", fill = "steelblue2") +
ggtitle("India's per-capita GDP")
x <- seq(-pi, pi, len = 1000)
sin.data <- data.frame(x = x, y = sin(x))
ggplot(data = sin.data, aes(x = x, y = y)) + geom_line() +
scale_y_continuous(name = "sin(x)")
Default themes include: theme_bw(), theme_classic(), theme_dark(), theme_gray(), theme_light(), theme_linedraw(), theme_minimal(), and theme_void().
ggplot(data = sin.data, aes(x = x, y = y)) + geom_line() +
scale_y_continuous(name = "sin(x)") +
theme_classic()
The ggthemes add-on package https://github.com/jrnold/ggthemes by Jefrey Arnold provides a large selection of themes beyond the eight themes that come with ggplot2
formats. The ggsave() function will allow you to save your most recent ggplot() to a variety of vector (e.g., “eps”, “ps”, “pdf”, “svg”) or raster (e.g., “jpeg”, “tif”, “png”, “bmp”, “wmf”) formats. The subsequent call to ggsave() saves thesin.data plot to a pdf file called “sin-plot.pdf”.
ggplot(filename = "sin-plot.pdf", device="pdf")
This involves bringing data into R, exporting data from R in a form that is readable by other software, cleaning and reshaping data, and other data manipulation
Data come in a dizzying variety of forms. It might be in a proprietary format such as an .xlsx Excel file, a .sav SPSS file, or a .mtw Minitab file.
The read.table() function is used to read these data into an R data frame.
u.bb <- "http://blue.for.msu.edu/FOR875/data/BrainAndBody.csv"
BrainBody <- read.table(file = u.bb, header = TRUE, sep = ",",
stringsAsFactors = FALSE)
head(BrainBody)
## body brain name
## 1 1.35 8.1 Mountain beaver
## 2 465.00 423.0 Cow
## 3 36.33 119.5 Grey wolf
## 4 27.66 115.0 Goat
## 5 1.04 5.5 Guinea pig
## 6 11700.00 50.0 Dipliodocus
The arguments used in this call to read.table() include:
file = u.bb, which tells R the location of the file.header = TRUE, which tells R the first line of the file gives the names of the variables.sep = ",", which tells R that a comma separates the fields in the file.stringsAsFactors = FALSE which tells R not to convert character vectors to factors.The function read.csv() is the same as read.table() except the default separator is a comma, whereas the default separator for read.table() is whitespace.
The file BrainAndBody.tsv contains the same data, except a tab is used in place of a comma to separate fields.
u.bb <- "http://blue.for.msu.edu/FOR875/data/BrainAndBody.tsv"
BrainBody <- read.table(file = u.bb, header = TRUE, sep = "\t",
stringsAsFactors = FALSE)
head(BrainBody)
## body brain name
## 1 1.35 8.1 Mountain beaver
## 2 465.00 423.0 Cow
## 3 36.33 119.5 Grey wolf
## 4 27.66 115.0 Goat
## 5 1.04 5.5 Guinea pig
## 6 11700.00 50.0 Dipliodocus
A third file, BrainAndBody.txt, contains the same data, but also contains a fewnlines of explanatory text above the names of the variables. It also uses whitespacen rather than a comma or a tab as a separator.
u.bb <- "http://blue.for.msu.edu/FOR875/data/BrainAndBody.txt"
BrainBody3 <- read.table(u.bb, header = TRUE, sep = " ", stringsAsFactors = FALSE, skip = 4)
BrainBody3[1:10,]
## body brain name
## 1 1.35 8.1 Mountain beaver
## 2 465.00 423.0 Cow
## 3 36.33 119.5 Grey wolf
## 4 27.66 115.0 Goat
## 5 1.04 5.5 Guinea pig
## 6 11700.00 50.0 Dipliodocus
## 7 2547.00 4603.0 Asian elephant
## 8 187.10 419.0 Donkey
## 9 521.00 655.0 Horse
## 10 10.00 115.0 Potar monkey
The read.table() function has an argument na.string which allows the user to specify how missing data is indicated in the source file.
Argument na.string = "" which tells R the file indicates missing data by leaving the appropriate entry blank.
u.weather <- "http://blue.for.msu.edu/FOR875/data/WeatherKLAN2014.csv"
WeatherKLAN2014 <- read.csv(u.weather, header=TRUE, stringsAsFactors = FALSE, na.string = "")
WeatherKLAN2014[1:15,]
## EST Max.TemperatureF Min.TemperatureF Events
## 1 1/1/14 14 9 Snow
## 2 1/2/14 13 -3 Snow
## 3 1/3/14 13 -11 Snow
## 4 1/4/14 31 13 Snow
## 5 1/5/14 29 16 Fog-Snow
## 6 1/6/14 16 -12 Fog-Snow
## 7 1/7/14 2 -13 Snow
## 8 1/8/14 17 -1 Snow
## 9 1/9/14 21 2 Snow
## 10 1/10/14 39 21 Fog-Rain-Snow
## 11 1/11/14 41 32 Fog-Rain
## 12 1/12/14 39 31 <NA>
## 13 1/13/14 44 34 Rain
## 14 1/14/14 37 26 Rain-Snow
## 15 1/15/14 27 18 Snow
Some common data tasks include variable summaries such as means or standard deviations, transforming an existing variable, and creating new variables.
u.weather <- "http://blue.for.msu.edu/FOR875/data/WeatherKLAN2014Full.csv"
WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE, stringsAsFactors = FALSE, na.string = "")
names(WeatherKLAN2014Full)
## [1] "EST" "Max.TemperatureF"
## [3] "Mean.TemperatureF" "Min.TemperatureF"
## [5] "Max.Dew.PointF" "MeanDew.PointF"
## [7] "Min.DewpointF" "Max.Humidity"
## [9] "Mean.Humidity" "Min.Humidity"
## [11] "Max.Sea.Level.PressureIn" "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn" "Max.VisibilityMiles"
## [15] "Mean.VisibilityMiles" "Min.VisibilityMiles"
## [17] "Max.Wind.SpeedMPH" "Mean.Wind.SpeedMPH"
## [19] "Max.Gust.SpeedMPH" "PrecipitationIn"
## [21] "CloudCover" "Events"
## [23] "WindDirDegrees"
To compute mean() for each column we can:
mean(WeatherKLAN2014Full$Mean.TemperatureF)
## [1] 45.78082
mean(WeatherKLAN2014Full$Min.TemperatureF)
## [1] 36.25479
mean(WeatherKLAN2014Full$Max.TemperatureF)
## [1] 54.83836
## Et Cetera
However, this wastes a lot of time. We can save time by using colMeans() function which computes the mean of each (or specified) columns in a data frame.
colMeans(WeatherKLAN2014Full[, c(2:19, 21, 23)])
## Max.TemperatureF Mean.TemperatureF
## 54.838356 45.780822
## Min.TemperatureF Max.Dew.PointF
## 36.254795 41.800000
## MeanDew.PointF Min.DewpointF
## 36.394521 30.156164
## Max.Humidity Mean.Humidity
## 88.082192 70.391781
## Min.Humidity Max.Sea.Level.PressureIn
## 52.200000 30.130247
## Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
## 30.015370 29.903890
## Max.VisibilityMiles Mean.VisibilityMiles
## 9.895890 8.249315
## Min.VisibilityMiles Max.Wind.SpeedMPH
## 4.824658 19.101370
## Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
## 8.679452 NA
## CloudCover WindDirDegrees
## 4.367123 205.000000
apply() functionR also has functions rowMeans(), colSums(), and rowSums(). The function apply() function applies a user-chosen function to either the rows or columns (or both) of a data frame. The arguments are X, the data frame of interest, MARGIN, specifying either rows (MARGIN = 1) or columns (MARGIN = 2), and FUN, the function to be applied.
apply(X = WeatherKLAN2014Full[, c(2:19, 21, 23)], MARGIN = 2,FUN = sd)
## Max.TemperatureF Mean.TemperatureF
## 22.2129716 20.9729330
## Min.TemperatureF Max.Dew.PointF
## 20.2596676 19.5167159
## MeanDew.PointF Min.DewpointF
## 20.0311014 20.8511271
## Max.Humidity Mean.Humidity
## 8.1909784 9.3660269
## Min.Humidity Max.Sea.Level.PressureIn
## 13.9461681 0.2032259
## Mean.Sea.Level.PressureIn Min.Sea.Level.PressureIn
## 0.2159484 0.2359542
## Max.VisibilityMiles Mean.VisibilityMiles
## 0.5790382 2.1059259
## Min.VisibilityMiles Max.Wind.SpeedMPH
## 3.8168143 6.4831246
## Mean.Wind.SpeedMPH Max.Gust.SpeedMPH
## 3.8862592 NA
## CloudCover WindDirDegrees
## 2.7798359 90.0673130
with()The with() function tells R that we are working with a particular data frame, and we don’t need to keep typing the name of the data frame.
with(WeatherKLAN2014Full, mean(Max.TemperatureF[CloudCover <4 & Max.Humidity > 85]))
## [1] 69.39241
m <- matrix(c(1:9), nrow = 3, ncol = 3)
m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
To calcutalte sum of each columns we can:
sum(m[,1])
## [1] 6
sum(m[,2])
## [1] 15
sum(m[,3])
## [1] 24
## or
colSums(m)
## [1] 6 15 24
A better way is to use apply() function Syntax apply(x, MARGIN, FUN) x data MARGIN how to apply the function 1 = rows 2 = columns FUN the function to apply
apply(X=m, MARGIN=2,FUN=sum)
## [1] 6 15 24
What if your function requires multiple arguments? USE:
example: Consider m and z where z = c(1,2,3) Compute the correclation between z and each column of m cor(x, y) apply(m, 2, cor, y=z)
if MARGIN = c(1, 2), apply the function on both rows and columns
l,s,t apply If you have a list and you want to apply a function to each element of the list, USE: lapply(X, FUN, ...) returns a list lapply(X, FUN, ...) returns a vector/matrix
If you want to process data by groups, use tapply tapply (X, INDEX, FUN, ...) INDEX ≠ group factor (levels/groups)
Commonly variables are added to, removed from, changed in, or rearranged in, a data frame.
library(gapminder)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
The data frame contains per capita GDP and population, and it might be interesting to create a variable that gives the total GDP by multiplying these two variables.
gapminder$TotalGDP <- gapminder$gdpPercap * gapminder$pop
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 7 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## $ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
Analogous to the with() function, there is a function within() which can simplify the syntax. Whereas with() does not change the data frame, within() can. Note, below I first remove the altered gapminder dataframe using rm() then bring a clean copy back in by reloading the gapminder package.
rm(gapminder)
library(gapminder)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
gapminder <- within(gapminder, TotalGDP <- gdpPercap * pop)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 7 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## $ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
A nice feature of within() is its ability to add more than one variable at a time to a data frame. In this case the two or more formulas creating new variables must be enclosed in braces.
gapminder <- within(gapminder, {TotalGDP <- gdpPercap * pop
lifeExpMonths <- lifeExp * 12})
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 8 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap : num 779 821 853 836 740 ...
## $ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
## $ lifeExpMonths: num 346 364 384 408 433 ...
After reflection we may realize the new variables we added to the gapminder data frame are not useful, and should be removed.
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 8 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap : num 779 821 853 836 740 ...
## $ TotalGDP : num 6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
## $ lifeExpMonths: num 346 364 384 408 433 ...
gapminder <- gapminder[1:6]
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
The same result could be obtained via gapminder <- gapminder[, 1:6].
a <- data.frame(x = 1:3, y = c("dog", "cat", "pig"), z = seq(from = 1,to = 2, length = 3))
a
## x y z
## 1 1 dog 1.0
## 2 2 cat 1.5
## 3 3 pig 2.0
b <- a[1]
b
## x
## 1 1
## 2 2
## 3 3
b <- a[, 1]
b
## [1] 1 2 3
c <- a[-(2:3)]
c
## x
## 1 1
## 2 2
## 3 3
#d <- a[-2:3]
One can also use a negative sign in front of the variable number(s). For example, a[-(2:3)] would drop the first two columns of a.
Imagine we want to modify the existing variable to measure life expectancy in months.
rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
## [1] 28.801 30.332 31.997 34.020 36.088
gapminder$lifeExp <- gapminder$lifeExp * 12
gapminder$lifeExp[1:5]
## [1] 345.612 363.984 383.964 408.240 433.056
rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
## [1] 28.801 30.332 31.997 34.020 36.088
gapminder <- within(gapminder, lifeExp <- lifeExp * 12)
gapminder$lifeExp[1:5]
## [1] 345.612 363.984 383.964 408.240 433.056
Consider the full weather data set again.
u.weather <- "http://blue.for.msu.edu/FOR875/data/WeatherKLAN2014Full.csv"
WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE,
stringsAsFactors = FALSE,
na.string = "")
names(WeatherKLAN2014Full)
## [1] "EST" "Max.TemperatureF"
## [3] "Mean.TemperatureF" "Min.TemperatureF"
## [5] "Max.Dew.PointF" "MeanDew.PointF"
## [7] "Min.DewpointF" "Max.Humidity"
## [9] "Mean.Humidity" "Min.Humidity"
## [11] "Max.Sea.Level.PressureIn" "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn" "Max.VisibilityMiles"
## [15] "Mean.VisibilityMiles" "Min.VisibilityMiles"
## [17] "Max.Wind.SpeedMPH" "Mean.Wind.SpeedMPH"
## [19] "Max.Gust.SpeedMPH" "PrecipitationIn"
## [21] "CloudCover" "Events"
## [23] "WindDirDegrees"
If we want the wind speed variables to come right after the date, we can again use subsetting.
WeatherKLAN2014Full <- WeatherKLAN2014Full[c(1, 17, 18, 19, 2:16,20:23)]
names(WeatherKLAN2014Full)
## [1] "EST" "Max.Wind.SpeedMPH"
## [3] "Mean.Wind.SpeedMPH" "Max.Gust.SpeedMPH"
## [5] "Max.TemperatureF" "Mean.TemperatureF"
## [7] "Min.TemperatureF" "Max.Dew.PointF"
## [9] "MeanDew.PointF" "Min.DewpointF"
## [11] "Max.Humidity" "Mean.Humidity"
## [13] "Min.Humidity" "Max.Sea.Level.PressureIn"
## [15] "Mean.Sea.Level.PressureIn" "Min.Sea.Level.PressureIn"
## [17] "Max.VisibilityMiles" "Mean.VisibilityMiles"
## [19] "Min.VisibilityMiles" "PrecipitationIn"
## [21] "CloudCover" "Events"
## [23] "WindDirDegrees"
A data set can be represented in several different formats. (Wide or Long)
The R library tidyr has functions for converting data between formats. A tidy dataset has: 1. Each variable has its own column 2. Each observation has its own row 3. Each value has its own cell
Useful tidy functions: 1. gather() 2. unite() 3. separate() 4. spread ()
u.rel <- "http://blue.for.msu.edu/FOR875/data/religion2.csv"
religion <- read.csv(u.rel, header = TRUE, stringsAsFactors = FALSE)
head(religion)
## religion under10k btw10and20k btw20and30k btw30and40k
## 1 Agnostic 27 34 60 81
## 2 Atheist 12 27 37 52
## 3 Buddhist 27 21 30 34
## 4 Catholic 418 617 732 670
## 5 DoNotKnowOrRefused 15 14 15 11
## 6 EvangelicalProt 575 869 1064 982
## btw40and50k btw50and75k btw75and100k btw100and150k over150k
## 1 76 137 122 109 84
## 2 35 70 73 59 74
## 3 33 58 62 39 53
## 4 638 1116 949 792 633
## 5 10 35 21 17 18
## 6 881 1486 949 723 414
## DoNotKnowOrRefused
## 1 96
## 2 76
## 3 54
## 4 1489
## 5 116
## 6 1529
1. TABLE 1
library(tidyr)
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
1. TABLE 2
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
1. TABLE 3
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
1. TABLE 4
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
The gather() function can transform data from wide to long format. It is useful when variable is spread acrosss multiple columns
To use gather() we specified the data frame (data = religion), the name we want to give to the column created from the income levels (key = IncomeLevel), the name we want to give to the column containing the frequency values (value = Frequency) and the columns to gather (2:11).
library(tidyr)
religionLong <- gather(data = religion, key = IncomeLevel, value = Frequency, 2:11)
head(religionLong)
## religion IncomeLevel Frequency
## 1 Agnostic under10k 27
## 2 Atheist under10k 12
## 3 Buddhist under10k 27
## 4 Catholic under10k 418
## 5 DoNotKnowOrRefused under10k 15
## 6 EvangelicalProt under10k 575
tail(religionLong)
## religion IncomeLevel Frequency
## 175 Muslim DoNotKnowOrRefused 22
## 176 Orthodox DoNotKnowOrRefused 73
## 177 OtherChristian DoNotKnowOrRefused 18
## 178 OtherFaiths DoNotKnowOrRefused 71
## 179 OtherWorldReligions DoNotKnowOrRefused 8
## 180 Unaffiliated DoNotKnowOrRefused 597
When observation is scattered across multiple rows. We can use spread(). This changes the data from Long -> wide format
religionWide <- spread(data = religionLong, key = IncomeLevel, value = Frequency)
head(religionWide)
## religion btw100and150k btw10and20k btw20and30k btw30and40k
## 1 Agnostic 109 34 60 81
## 2 Atheist 59 27 37 52
## 3 Buddhist 39 21 30 34
## 4 Catholic 792 617 732 670
## 5 DoNotKnowOrRefused 17 14 15 11
## 6 EvangelicalProt 723 869 1064 982
## btw40and50k btw50and75k btw75and100k DoNotKnowOrRefused over150k
## 1 76 137 122 96 84
## 2 35 70 73 76 74
## 3 33 58 62 54 53
## 4 638 1116 949 1489 633
## 5 10 35 21 116 18
## 6 881 1486 949 1529 414
## under10k
## 1 27
## 2 12
## 3 27
## 4 418
## 5 15
## 6 575
Here we specify the data frame (religionLong), the column (IncomeLevel) to be spread, and the column of values (Frequency) to be spread among the newly created columns.
tidyr provides two other useful functions to separate and unite variables based on some deliminator. Consider again the yearlyIncomeWide table. Say we want to split the name variable into first and last name. This can be done using the separate() function. It is useful when cells have multiple values
yearlyIncomeLong <- data.frame(name = c("John Son", "Jim Doe", "Sam Nil"), y = c("dog", "cat", "pig"), z = seq(from = 1,to = 2, length = 3))
firstLast <- separate(data = yearlyIncomeLong, col = name, into = c("first", "last"), sep="\\s")
print(firstLast)
## first last y z
## 1 John Son dog 1.0
## 2 Jim Doe cat 1.5
## 3 Sam Nil pig 2.0
If you want to combine the name column again, This is done using the unite() function. It is useful when a single value in multiple cells
unite(firstLast, col = name, first, last, sep = "_")
## name y z
## 1 John_Son dog 1.0
## 2 Jim_Doe cat 1.5
## 3 Sam_Nil pig 2.0
separate(data = table3, col = "rate", into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## * <chr> <int> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
gather(table4a, key = "year", value = "cases", -1)
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
mutate() adds new variables that are functions of existing variables, select() picks variables based on their names, filter() picks cases based on their values, summarize() reduces multiple values down to a single summary, arrange() changes the ordering of the rows.
These all combine naturally with a group_by() that allows you to perform any operation grouped by values of one or more variables.
tbl_df() creates a tibble3.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
head(religionWide)
## religion btw100and150k btw10and20k btw20and30k btw30and40k
## 1 Agnostic 109 34 60 81
## 2 Atheist 59 27 37 52
## 3 Buddhist 39 21 30 34
## 4 Catholic 792 617 732 670
## 5 DoNotKnowOrRefused 17 14 15 11
## 6 EvangelicalProt 723 869 1064 982
## btw40and50k btw50and75k btw75and100k DoNotKnowOrRefused over150k
## 1 76 137 122 96 84
## 2 35 70 73 76 74
## 3 33 58 62 54 53
## 4 638 1116 949 1489 633
## 5 10 35 21 116 18
## 6 881 1486 949 1529 414
## under10k
## 1 27
## 2 12
## 3 27
## 4 418
## 5 15
## 6 575
religionWideTbl <- tbl_df(religionWide)
head(religionWideTbl)
## # A tibble: 6 x 11
## religion btw100and150k btw10and20k btw20and30k btw30and40k btw40and50k
## <chr> <int> <int> <int> <int> <int>
## 1 Agnostic 109 34 60 81 76
## 2 Atheist 59 27 37 52 35
## 3 Buddhist 39 21 30 34 33
## 4 Catholic 792 617 732 670 638
## 5 DoNotKn… 17 14 15 11 10
## 6 Evangel… 723 869 1064 982 881
## # ... with 5 more variables: btw50and75k <int>, btw75and100k <int>,
## # DoNotKnowOrRefused <int>, over150k <int>, under10k <int>
The read.delim() function defaults to header = TRUE
The filter() is used to select rows based on certain criteria of columns varibles
u.gm <- "http://blue.for.msu.edu/FOR875/data/gapminder.tsv"
gm <- read.delim(u.gm)
gm <- tbl_df(gm)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 346 364 384 408 433 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
head(gm)
## # A tibble: 6 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
Filtering helps us to examine subsets of the data such as data from a particular country or from several specified countries, data from certain years, from countries with certain populations, etc. Some examples:
filter(gm, country == "Brazil")
## # A tibble: 12 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Brazil 1952 56602560 Americas 50.9 2109.
## 2 Brazil 1957 65551171 Americas 53.3 2487.
## 3 Brazil 1962 76039390 Americas 55.7 3337.
## 4 Brazil 1967 88049823 Americas 57.6 3430.
## 5 Brazil 1972 100840058 Americas 59.5 4986.
## 6 Brazil 1977 114313951 Americas 61.5 6660.
## 7 Brazil 1982 128962939 Americas 63.3 7031.
## 8 Brazil 1987 142938076 Americas 65.2 7807.
## 9 Brazil 1992 155975974 Americas 67.1 6950.
## 10 Brazil 1997 168546719 Americas 69.4 7958.
## 11 Brazil 2002 179914212 Americas 71.0 8131.
## 12 Brazil 2007 190010647 Americas 72.4 9066.
filter(gm, country %in% c("Brazil", "Mexico"))
## # A tibble: 24 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Brazil 1952 56602560 Americas 50.9 2109.
## 2 Brazil 1957 65551171 Americas 53.3 2487.
## 3 Brazil 1962 76039390 Americas 55.7 3337.
## 4 Brazil 1967 88049823 Americas 57.6 3430.
## 5 Brazil 1972 100840058 Americas 59.5 4986.
## 6 Brazil 1977 114313951 Americas 61.5 6660.
## 7 Brazil 1982 128962939 Americas 63.3 7031.
## 8 Brazil 1987 142938076 Americas 65.2 7807.
## 9 Brazil 1992 155975974 Americas 67.1 6950.
## 10 Brazil 1997 168546719 Americas 69.4 7958.
## # ... with 14 more rows
filter(gm, country %in% c("Brazil", "Mexico") & year %in% c(1952, 1972))
## # A tibble: 4 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Brazil 1952 56602560 Americas 50.9 2109.
## 2 Brazil 1972 100840058 Americas 59.5 4986.
## 3 Mexico 1952 30144317 Americas 50.8 3478.
## 4 Mexico 1972 55984294 Americas 62.4 6809.
filter(gm, pop > 300000000)
## # A tibble: 25 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 China 1952 556263528. Asia 44 400.
## 2 China 1957 637408000 Asia 50.5 576.
## 3 China 1962 665770000 Asia 44.5 488.
## 4 China 1967 754550000 Asia 58.4 613.
## 5 China 1972 862030000 Asia 63.1 677.
## 6 China 1977 943455000 Asia 64.0 741.
## 7 China 1982 1000281000 Asia 65.5 962.
## 8 China 1987 1084035000 Asia 67.3 1379.
## 9 China 1992 1164970000 Asia 68.7 1656.
## 10 China 1997 1230075000 Asia 70.4 2289.
## # ... with 15 more rows
filter(gm, pop > 300000000 & year == 2007)
## # A tibble: 3 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 China 2007 1318683096 Asia 73.0 4959.
## 2 India 2007 1110396331 Asia 64.7 2452.
## 3 United States 2007 301139947 Americas 78.2 42952.
The select() function will choose varibles of dataframe.
Variables can be selected by name or column number. As usual, a negative sign tells R to leave something out. And there are special functions such as starts_with that provide ways to match part of a variable’s name.
select(gm, country, year, lifeExp)
## # A tibble: 1,704 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Afghanistan 1952 28.8
## 2 Afghanistan 1957 30.3
## 3 Afghanistan 1962 32.0
## 4 Afghanistan 1967 34.0
## 5 Afghanistan 1972 36.1
## 6 Afghanistan 1977 38.4
## 7 Afghanistan 1982 39.9
## 8 Afghanistan 1987 40.8
## 9 Afghanistan 1992 41.7
## 10 Afghanistan 1997 41.8
## # ... with 1,694 more rows
select(gm, 2:4)
## # A tibble: 1,704 x 3
## year pop continent
## <int> <dbl> <fct>
## 1 1952 8425333 Asia
## 2 1957 9240934 Asia
## 3 1962 10267083 Asia
## 4 1967 11537966 Asia
## 5 1972 13079460 Asia
## 6 1977 14880372 Asia
## 7 1982 12881816 Asia
## 8 1987 13867957 Asia
## 9 1992 16317921 Asia
## 10 1997 22227415 Asia
## # ... with 1,694 more rows
select(gm, -c(2, 3, 4))
## # A tibble: 1,704 x 3
## country lifeExp gdpPercap
## <fct> <dbl> <dbl>
## 1 Afghanistan 28.8 779.
## 2 Afghanistan 30.3 821.
## 3 Afghanistan 32.0 853.
## 4 Afghanistan 34.0 836.
## 5 Afghanistan 36.1 740.
## 6 Afghanistan 38.4 786.
## 7 Afghanistan 39.9 978.
## 8 Afghanistan 40.8 852.
## 9 Afghanistan 41.7 649.
## 10 Afghanistan 41.8 635.
## # ... with 1,694 more rows
select(gm, starts_with("c"))
## # A tibble: 1,704 x 2
## country continent
## <fct> <fct>
## 1 Afghanistan Asia
## 2 Afghanistan Asia
## 3 Afghanistan Asia
## 4 Afghanistan Asia
## 5 Afghanistan Asia
## 6 Afghanistan Asia
## 7 Afghanistan Asia
## 8 Afghanistan Asia
## 9 Afghanistan Asia
## 10 Afghanistan Asia
## # ... with 1,694 more rows
Consider selecting the country, year, and population for countries in Asia or Europe. One possibility is to nest a filter() function inside a select() function.
select(filter(gm, continent %in% c("Asia", "Europe")), country,year, pop)
## # A tibble: 756 x 3
## country year pop
## <fct> <int> <dbl>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # ... with 746 more rows
There is a nice feature in dplyr that allows us to feed results of one function into the first argument of a subsequent function. The %>% operator does the piping.
gm %>% filter(continent %in% c("Asia", "Europe")) %>% select(country, year, pop)
## # A tibble: 756 x 3
## country year pop
## <fct> <int> <dbl>
## 1 Afghanistan 1952 8425333
## 2 Afghanistan 1957 9240934
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
## # ... with 746 more rows
By default the gapminder data are arranged by country and then by year The arrange() reoders the rows
head(gm, 10)
## # A tibble: 10 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
## 7 Afghanistan 1982 12881816 Asia 39.9 978.
## 8 Afghanistan 1987 13867957 Asia 40.8 852.
## 9 Afghanistan 1992 16317921 Asia 41.7 649.
## 10 Afghanistan 1997 22227415 Asia 41.8 635.
Possibly arranging the data by year and then country would be desired. The arrange() function makes this easy. We will again use pipes.
gm %>% arrange(year, country)
## # A tibble: 1,704 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Albania 1952 1282697 Europe 55.2 1601.
## 3 Algeria 1952 9279525 Africa 43.1 2449.
## 4 Angola 1952 4232095 Africa 30.0 3521.
## 5 Argentina 1952 17876956 Americas 62.5 5911.
## 6 Australia 1952 8691212 Oceania 69.1 10040.
## 7 Austria 1952 6927772 Europe 66.8 6137.
## 8 Bahrain 1952 120447 Asia 50.9 9867.
## 9 Bangladesh 1952 46886859 Asia 37.5 684.
## 10 Belgium 1952 8730405 Europe 68 8343.
## # ... with 1,694 more rows
How about the data for Rwanda, arranged in order of life expectancy.
gm %>% filter(country == "Rwanda") %>% arrange(lifeExp)
## # A tibble: 12 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Rwanda 1992 7290203 Africa 23.6 737.
## 2 Rwanda 1997 7212583 Africa 36.1 590.
## 3 Rwanda 1952 2534927 Africa 40 493.
## 4 Rwanda 1957 2822082 Africa 41.5 540.
## 5 Rwanda 1962 3051242 Africa 43 597.
## 6 Rwanda 2002 7852401 Africa 43.4 786.
## 7 Rwanda 1987 6349365 Africa 44.0 848.
## 8 Rwanda 1967 3451079 Africa 44.1 511.
## 9 Rwanda 1972 3992121 Africa 44.6 591.
## 10 Rwanda 1977 4657072 Africa 45 670.
## 11 Rwanda 1982 5507565 Africa 46.2 882.
## 12 Rwanda 2007 8860588 Africa 46.2 863.
Possibly we want these data to be in decreasing (descending) order. Here, desc() is one of many dplyr helper functions.
gm %>% filter(country == "Rwanda") %>% arrange(desc(lifeExp))
## # A tibble: 12 x 6
## country year pop continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Rwanda 2007 8860588 Africa 46.2 863.
## 2 Rwanda 1982 5507565 Africa 46.2 882.
## 3 Rwanda 1977 4657072 Africa 45 670.
## 4 Rwanda 1972 3992121 Africa 44.6 591.
## 5 Rwanda 1967 3451079 Africa 44.1 511.
## 6 Rwanda 1987 6349365 Africa 44.0 848.
## 7 Rwanda 2002 7852401 Africa 43.4 786.
## 8 Rwanda 1962 3051242 Africa 43 597.
## 9 Rwanda 1957 2822082 Africa 41.5 540.
## 10 Rwanda 1952 2534927 Africa 40 493.
## 11 Rwanda 1997 7212583 Africa 36.1 590.
## 12 Rwanda 1992 7290203 Africa 23.6 737.
Possibly we want to include only the year and life expectancy, to make the message more stark.
gm %>% filter(country == "Rwanda") %>% select(year, lifeExp) %>% arrange(desc(lifeExp))
## # A tibble: 12 x 2
## year lifeExp
## <int> <dbl>
## 1 2007 46.2
## 2 1982 46.2
## 3 1977 45
## 4 1972 44.6
## 5 1967 44.1
## 6 1987 44.0
## 7 2002 43.4
## 8 1962 43
## 9 1957 41.5
## 10 1952 40
## 11 1997 36.1
## 12 1992 23.6
The dplyr package has a rename function that makes renaming variables in a data frame quite easy.
gm <- rename(gm, population = pop)
head(gm)
## # A tibble: 6 x 6
## country year population continent lifeExp gdpPercap
## <fct> <int> <dbl> <fct> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.8 779.
## 2 Afghanistan 1957 9240934 Asia 30.3 821.
## 3 Afghanistan 1962 10267083 Asia 32.0 853.
## 4 Afghanistan 1967 11537966 Asia 34.0 836.
## 5 Afghanistan 1972 13079460 Asia 36.1 740.
## 6 Afghanistan 1977 14880372 Asia 38.4 786.
The summarize() function computes summary statistics or user provided function for one or more columns of data in a data frame.
summarize(gm, meanpop = mean(population), medpop = median(population))
## # A tibble: 1 x 2
## meanpop medpop
## <dbl> <dbl>
## 1 29601212. 7023596.
For example, we might want the median life expectancy for each continent separately. One option is subsetting:
median(gm$lifeExp[gm$continent == "Africa"])
## [1] 47.792
median(gm$lifeExp[gm$continent == "Asia"])
## [1] 61.7915
median(gm$lifeExp[gm$continent == "Europe"])
## [1] 72.241
median(gm$lifeExp[gm$continent == "Americas"])
## [1] 67.048
median(gm$lifeExp[gm$continent == "Oceania"])
## [1] 73.665
The group_by() function makes this easier, and makes the output more useful.
gm %>% group_by(continent) %>% summarize(medLifeExp = median(lifeExp))
## # A tibble: 5 x 2
## continent medLifeExp
## <fct> <dbl>
## 1 Africa 47.8
## 2 Americas 67.0
## 3 Asia 61.8
## 4 Europe 72.2
## 5 Oceania 73.7
Or if we want the results ordered by the median life expectancy:
gm %>% group_by(continent) %>% summarize(medLifeExp = median(lifeExp)) %>% arrange(medLifeExp)
## # A tibble: 5 x 2
## continent medLifeExp
## <fct> <dbl>
## 1 Africa 47.8
## 2 Asia 61.8
## 3 Americas 67.0
## 4 Europe 72.2
## 5 Oceania 73.7
We can calculate the number of observations we have per continent (using the n() helper function),
gm %>% group_by(continent) %>% summarize(numObs = n())
## # A tibble: 5 x 2
## continent numObs
## <fct> <int>
## 1 Africa 624
## 2 Americas 300
## 3 Asia 396
## 4 Europe 360
## 5 Oceania 24
gm %>% group_by(continent) %>% summarize(n_obs = n(), n_countries = n_distinct(country))
## # A tibble: 5 x 3
## continent n_obs n_countries
## <fct> <int> <int>
## 1 Africa 624 52
## 2 Americas 300 25
## 3 Asia 396 33
## 4 Europe 360 30
## 5 Oceania 24 2
Here is a bit more involved example that calculates the minimum and maximum life expectancies for countries in Africa by year.
gm %>% filter(continent == "Africa") %>% group_by(year) %>% summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))
## # A tibble: 12 x 3
## year min_lifeExp max_lifeExp
## <int> <dbl> <dbl>
## 1 1952 30 52.7
## 2 1957 31.6 58.1
## 3 1962 32.8 60.2
## 4 1967 34.1 61.6
## 5 1972 35.4 64.3
## 6 1977 36.8 67.1
## 7 1982 38.4 69.9
## 8 1987 39.9 71.9
## 9 1992 23.6 73.6
## 10 1997 36.1 74.8
## 11 2002 39.2 75.7
## 12 2007 39.6 76.4
This is interesting, but the results don’t include the countries that achieved the minimum and maximum life expectancies.
gm %>% select(country, continent, year, lifeExp) %>% group_by(year) %>% arrange(year) %>% filter(rank(lifeExp) == 1)
## # A tibble: 12 x 4
## # Groups: year [12]
## country continent year lifeExp
## <fct> <fct> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Afghanistan Asia 1957 30.3
## 3 Afghanistan Asia 1962 32.0
## 4 Afghanistan Asia 1967 34.0
## 5 Sierra Leone Africa 1972 35.4
## 6 Cambodia Asia 1977 31.2
## 7 Sierra Leone Africa 1982 38.4
## 8 Angola Africa 1987 39.9
## 9 Rwanda Africa 1992 23.6
## 10 Rwanda Africa 1997 36.1
## 11 Zambia Africa 2002 39.2
## 12 Swaziland Africa 2007 39.6
Next we add the maximum life expectancy
gm %>% select(country, continent, year, lifeExp) %>% group_by(year) %>% arrange(year) %>% filter(rank(lifeExp) == 1 | rank(desc(lifeExp)) == 1) %>% print(n=24)
## # A tibble: 24 x 4
## # Groups: year [12]
## country continent year lifeExp
## <fct> <fct> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8
## 2 Norway Europe 1952 72.7
## 3 Afghanistan Asia 1957 30.3
## 4 Iceland Europe 1957 73.5
## 5 Afghanistan Asia 1962 32.0
## 6 Iceland Europe 1962 73.7
## 7 Afghanistan Asia 1967 34.0
## 8 Sweden Europe 1967 74.2
## 9 Sierra Leone Africa 1972 35.4
## 10 Sweden Europe 1972 74.7
## 11 Cambodia Asia 1977 31.2
## 12 Iceland Europe 1977 76.1
## 13 Japan Asia 1982 77.1
## 14 Sierra Leone Africa 1982 38.4
## 15 Angola Africa 1987 39.9
## 16 Japan Asia 1987 78.7
## 17 Japan Asia 1992 79.4
## 18 Rwanda Africa 1992 23.6
## 19 Japan Asia 1997 80.7
## 20 Rwanda Africa 1997 36.1
## 21 Japan Asia 2002 82
## 22 Zambia Africa 2002 39.2
## 23 Japan Asia 2007 82.6
## 24 Swaziland Africa 2007 39.6
The $ notation provides a simple way to create new variables in a data frame. The mutate() function provides another, sometimes cleaner way to do this. It is used to add a new variable to the dataframe
gm %>% group_by(country) %>% mutate(changeLifeExp = lifeExp -
+ lag(lifeExp, order_by = year)) %>% select(-c(population,
+ gdpPercap))
## # A tibble: 1,704 x 5
## # Groups: country [142]
## country year continent lifeExp changeLifeExp
## <fct> <int> <fct> <dbl> <dbl>
## 1 Afghanistan 1952 Asia 28.8 NA
## 2 Afghanistan 1957 Asia 30.3 1.53
## 3 Afghanistan 1962 Asia 32.0 1.66
## 4 Afghanistan 1967 Asia 34.0 2.02
## 5 Afghanistan 1972 Asia 36.1 2.07
## 6 Afghanistan 1977 Asia 38.4 2.35
## 7 Afghanistan 1982 Asia 39.9 1.42
## 8 Afghanistan 1987 Asia 40.8 0.968
## 9 Afghanistan 1992 Asia 41.7 0.852
## 10 Afghanistan 1997 Asia 41.8 0.0890
## # ... with 1,694 more rows
data("mtcars")
mtcars <- as_tibble(mtcars)
mtcars
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ... with 22 more rows
filter(mtcars, cyl==6 & am ==1)
## # A tibble: 3 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
arrange(mtcars, hp)
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
## 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
## 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
## 5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
## 6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
## 7 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 8 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 9 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 10 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## # ... with 22 more rows
ratio_mtcars <- mutate(mtcars, ratio = hp/mpg)
ratio_mtcars
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb ratio
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 5.24
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 5.24
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4.08
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5.14
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 9.36
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 5.80
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 17.1
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 2.54
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 4.17
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 6.41
## # ... with 22 more rows
group_by(mtcars, cyl) %>% summarise(mean.hp =mean(hp))
## # A tibble: 3 x 2
## cyl mean.hp
## <dbl> <dbl>
## 1 4 82.6
## 2 6 122.
## 3 8 209.
This chapter will provide a brief introduction to using R for simulationt
The purpose of simulations: 1. Test statistical intuition 2. Validate theory 3. Experiemnt with a new technique
Generate random numbers from different distribution with rrnorm rbinom rchi
To generate from a finite populattion use sample() sample( X, size, replace=F, prob=NULL) X = vector of data to sample
Monty Hall Ploblem Is it better to stay or switch doors? Or does it not matter?
Simulations use numbers generated by a pseudo-random number generator (PRNG) rather than by a truly random number generator.
Linear congruential generators are a common starting point. A linear congruential generator requires a multiplier (a), an increment (c), a modulus (m), and a seed X0, all integers. The generator is defined by
\(X_i = (aX_{i-1} + c)\) mod m:
Recall that Z mod m gives the remainder when Z is divided by m. In R Z mod m is implemented as Z %% m.
15%%2
## [1] 1
15%%5
## [1] 0
15%%6
## [1] 3
The function lcg() below implements this idea in R.
lcg <- function(a, c, seed, m, iter) {
out <- numeric(iter)
out[1] <- seed
for (i in 2:(iter)) {
out[i] <- (a * out[i - 1] + c)%%m
}
return(out)
}
lcg(a = 65, c = 1, seed = 0, m = 2048, iter = 10)/2048
## [1] 0.0000000000 0.0004882812 0.0322265625 0.0952148438 0.1894531250
## [6] 0.3149414062 0.4716796875 0.6596679688 0.8789062500 0.1293945312
lcg(a = 1365, c = 1, seed = 0, m = 2048, iter = 10)/2048
## [1] 0.0000000000 0.0004882812 0.6669921875 0.4448242188 0.1855468750
## [6] 0.2719726562 0.2431640625 0.9194335938 0.0273437500 0.3247070312
lcg(a = 1229, c = 1, seed = 0, m = 2048, iter = 10)/2048
## [1] 0.0000000000 0.0004882812 0.6005859375 0.1206054688 0.2246093750
## [6] 0.0454101562 0.8095703125 0.9624023438 0.7929687500 0.5590820312
A very basic way to assess the performance of a PRNG is to draw a histogram of the values. The histogram should look like the histogram of uniformly distributed values.
tmp_random <- data.frame(x = lcg(a = 1229, c = 1, seed = 0, m = 2048, iter = 2048)/2048)
ggplot(tmp_random, aes(x = x)) + geom_histogram(binwidth = 0.1)
Being able to provide a seed is important for repeatability of simulations. The set.seed() function in R provides this capability
set.seed(123)
runif(5)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673
set.seed(123)
runif(5)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673
runif(5)
## [1] 0.0455565 0.5281055 0.8924190 0.5514350 0.4566147
For example suppose that we want to generate coin flips, i.e., to generate from a Bernoulli (0.5) distribution. Here is one way to do this.
generate_coin <- function(n) {
out <- rep(0, n)
out[runif(n) > 0.5] <- 1
return(out)
}
generate_coin(10)
## [1] 1 0 1 1 0 1 0 0 0 1
If we wanted H and T for heads and tails, a similar method would work.
generate_coin <- function(n) {
out <- rep("T", n)
out[runif(n) > 0.5] <- "H"
return(out)
}
generate_coin(10)
## [1] "H" "H" "H" "H" "H" "H" "H" "H" "T" "T"
R provides functions that give the density (or mass function), cumulative distribution function, quantile function, and generate values.
dnorm is the normal density functionpnorm is the normal cumulative distribution functionqnorm is the normal quantile functionrnorm generates values from a normal distributionRather than designing our own methods for simulating from these distributions, it is better to use the built in functions.
# 10 values from the N(100, 2) distribution
rnorm(10, mean = 100, sd = 2)
## [1] 103.57383 100.99570 96.06677 101.40271 99.05442 97.86435 99.56405
## [8] 97.94799 98.54222 98.74992
# 5 values from the Uniform(-3, 4) distribution
runif(5, min = -3, max = 4)
## [1] -2.67918183 0.09540052 2.59247392 -2.14670518 0.92663589
# 7 values from the binomial(10, 0.2) distribution
rbinom(7, size = 10, p = 0.2)
## [1] 1 1 3 4 1 2 0
In addition, the sample function provides a way to sample from a finite population.
# 10 coin flips
sample(c("H", "T"), 10, replace = TRUE)
## [1] "H" "H" "T" "H" "T" "T" "T" "H" "T" "T"
# a random permutation
sample(1:9, 9, replace = FALSE)
## [1] 7 1 4 2 6 3 5 8 9
# sample from a distribution with P(X=2) = 0.3; P(X=3)=0.5; and P(X=17)=0.2
sample(c(2, 3, 17), size = 15, replace = TRUE, prob = c(0.3, 0.5, 0.2))
## [1] 2 3 2 3 3 17 17 17 3 3 2 3 2 3 3
t testsInverse CDF method (Inverse Transform Sampling)
Let X be our random variable of interest and assume X has CDF, \(F_x(x)\)
GOAL: Generate random varibles from X
ALGORITHM 1. Generate \(U=Unif(0, 1)\) 2. Find x such that \(F_x(x) = U\) => \(x = F_x^-1 (U)\)
Theorem: Let X have continous CDF and define the variable Y such that \(Y=Unif(0,1)\) \(Y = F_x(x)\) Then \(Y = Unif(0,1)\)
Proof: \(P(Y=< y)\) = \(P(F_x(x) =<y)\)
Solve: \(x_1 = F_x^-1 (U_i)\) U is uniform \(1-e^(z/x_1) = U_\) \(-e^(-z/x_i) = u_i -1\) \(x_i = -zln(i-u_i)\)
You can control random number generation with set.seed(). R uses Mersenne Twitter algorithm to generate random numbers based off the seed
An Example: Algorithm 1. Select a seed 2. Multiply the seed by itself 3. Output the middle number 4. Use the result as the new seed
seed = 14
seed * seed # 196 -> 19
## [1] 196
19 * 19 # 361 -> 36
## [1] 361
36 * 36 # 1296 - 29
## [1] 1296
Classifcation methods are both extremely useful and an active area of research in statistics. In this chapter we will learn about two common, and somewhat different, classification methods, logistic regression and k nearest neighbors
We would like to predict the value of Y based on the value of the predictor X. Let p(X) = P(Y = 1jX). We will model p(X) with the logistic function, which takes values between 0 and 1, which is of course appropriate for modeling a probability:
library(ggplot2)
logistic <- function(x) {
exp(x)/(1 + exp(x))
}
ggplot(data.frame(x = c(-6, 6)), aes(x)) + stat_function(fun = logistic)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
head(Pima.tr)
## npreg glu bp skin bmi ped age type
## 1 5 86 68 28 30.2 0.364 24 No
## 2 7 195 70 33 25.1 0.163 55 Yes
## 3 5 77 82 41 35.8 0.156 35 No
## 4 0 165 76 43 47.9 0.259 26 No
## 5 0 107 60 25 26.4 0.133 23 No
## 6 5 97 76 27 35.6 0.378 52 Yes
It will be more convenient to code the presence or absence of diabetes by 1 and 0, so we first create another column in the data frame with this coding:
Pima.tr$diabetes <- rep(0, dim(Pima.tr)[1])
Pima.tr$diabetes[Pima.tr$type == "Yes"] <- 1
head(Pima.tr)
## npreg glu bp skin bmi ped age type diabetes
## 1 5 86 68 28 30.2 0.364 24 No 0
## 2 7 195 70 33 25.1 0.163 55 Yes 1
## 3 5 77 82 41 35.8 0.156 35 No 0
## 4 0 165 76 43 47.9 0.259 26 No 0
## 5 0 107 60 25 26.4 0.133 23 No 0
## 6 5 97 76 27 35.6 0.378 52 Yes 1
In R we will use the function glm to fit logistic regression models The glm function fits a wide variety of models.
To specify the logistic regression model we specify family = binomial as an argument to glm.
(diabetes ~ glu)diabetes.lr1 <- glm(diabetes ~ glu, data = Pima.tr, family = binomial)
diabetes.lr1
##
## Call: glm(formula = diabetes ~ glu, family = binomial, data = Pima.tr)
##
## Coefficients:
## (Intercept) glu
## -5.50364 0.03778
##
## Degrees of Freedom: 199 Total (i.e. Null); 198 Residual
## Null Deviance: 256.4
## Residual Deviance: 207.4 AIC: 211.4
summary(diabetes.lr1)
##
## Call:
## glm(formula = diabetes ~ glu, family = binomial, data = Pima.tr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9714 -0.7795 -0.5292 0.8491 2.2633
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.503636 0.836077 -6.583 4.62e-11 ***
## glu 0.037784 0.006278 6.019 1.76e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 207.37 on 198 degrees of freedom
## AIC: 211.37
##
## Number of Fisher Scoring iterations: 4
beta0.lr.1 <- coef(diabetes.lr1)[1]
beta1.lr.1 <- coef(diabetes.lr1)[2]
beta0.lr.1
## (Intercept)
## -5.503636
beta1.lr.1
## glu
## 0.03778372
The coefficients b0 and b1 are approximately b0 = -5.504 and b1 = 0:038. So for example we can estimate the probability that a woman in this population whose glucose level is 150 would be diabetic as:
exp(-5.504 + 0.038 * 150)/(1 + exp(-5.504 + 0.038 * 150))
## [1] 0.5488437
We can plot the fitted probabilities along with the data by hand.
diabetes.logistic.1 <- function(x){
exp(beta0.lr.1 + beta1.lr.1 * x)/(1 + exp(beta0.lr.1 + beta1.lr.1 * x))
}
ggplot(Pima.tr, aes(x = glu, y = diabetes)) +
stat_function(fun = diabetes.logistic.1)+
geom_point()
The ggplot2 package also provides a way to do this more directly, using stat_smooth.
ggplot(Pima.tr, aes(x = glu, y = diabetes)) +
geom_point() +
stat_smooth(method = "glm", method.args = list(family = "binomial"), se = F)
From these graphics we can see that although glucose level and diabetes are related, there are many women with high glucose levels who are not diabetic, and many with low glucose levels who are diabetic, so likely adding other predictors to the model will help.
head(Pima.te)
## npreg glu bp skin bmi ped age type
## 1 6 148 72 35 33.6 0.627 50 Yes
## 2 1 85 66 29 26.6 0.351 31 No
## 3 1 89 66 23 28.1 0.167 21 No
## 4 3 78 50 32 31.0 0.248 26 Yes
## 5 2 197 70 45 30.5 0.158 53 Yes
## 6 5 166 72 19 25.8 0.587 51 Yes
diabetes.probs.1 <- predict(diabetes.lr1, Pima.te, type = "response")
head(diabetes.probs.1)
## 1 2 3 4 5 6
## 0.52207428 0.09178605 0.10518608 0.07199064 0.87432542 0.68318799
The predict function (with type = "response" specifed) provides p(x) = P(Y = 1jX = x) for all the x values in a data frame.
diabetes.predict.1 <- rep("No", dim(Pima.te)[1])
diabetes.predict.1[diabetes.probs.1 > 0.5] <- "Yes"
head(diabetes.predict.1)
## [1] "Yes" "No" "No" "No" "Yes" "Yes"
length(diabetes.predict.1[diabetes.predict.1 == Pima.te$type])/dim(Pima.te)[1]
## [1] 0.7740964
We will next see how adding bmi, the body mass index, affects predictions of diabetes.
diabetes.lr2 <- glm(diabetes ~ glu + bmi, data = Pima.tr, family = binomial)
diabetes.lr2
##
## Call: glm(formula = diabetes ~ glu + bmi, family = binomial, data = Pima.tr)
##
## Coefficients:
## (Intercept) glu bmi
## -8.21611 0.03572 0.09002
##
## Degrees of Freedom: 199 Total (i.e. Null); 197 Residual
## Null Deviance: 256.4
## Residual Deviance: 198.5 AIC: 204.5
summary(diabetes.lr2)
##
## Call:
## glm(formula = diabetes ~ glu + bmi, family = binomial, data = Pima.tr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0577 -0.7566 -0.4313 0.8011 2.2489
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.216106 1.346965 -6.100 1.06e-09 ***
## glu 0.035716 0.006311 5.659 1.52e-08 ***
## bmi 0.090016 0.031268 2.879 0.00399 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 198.47 on 197 degrees of freedom
## AIC: 204.47
##
## Number of Fisher Scoring iterations: 4
diabetes.probs.2 <- predict(diabetes.lr2, Pima.te, type = "response")
head(diabetes.probs.2)
## 1 2 3 4 5 6
## 0.52358599 0.05809584 0.07530477 0.06662362 0.82713369 0.50879270
diabetes.predict.2 <- rep("No", dim(Pima.te)[1])
diabetes.predict.2[diabetes.probs.2 > 0.5] <- "Yes"
head(diabetes.predict.2)
## [1] "Yes" "No" "No" "No" "Yes" "Yes"
table(diabetes.predict.2, Pima.te$type)
##
## diabetes.predict.2 No Yes
## No 204 54
## Yes 19 55
length(diabetes.predict.2[diabetes.predict.2 == Pima.te$type])/dim(Pima.te)[1]
## [1] 0.7801205
Adding bmi as a predictor did not improve the predictions by very much!
lr.int <- -coef(diabetes.lr2)[1]/coef(diabetes.lr2)[3]
lr.slope <- -coef(diabetes.lr2)[2]/coef(diabetes.lr2)[3]
ggplot(Pima.tr, aes(x = glu, y = bmi)) + geom_point(aes(color = type)) +
geom_abline(intercept = lr.int, slope = lr.slope)
Logistic regression methods also are applicable to classification contexts where there are more than two classes. Consider for example Fisher’s iris data.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species))
ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width)) +
geom_point(aes(color = Species))
Here the potential predictors are sepal width and length and petal width and length, and the goal is to find a classifier that will yield the correct species.
Before doing that we randomly choose 75 of the 150 rows of the data frame to be the training sample, with the other 75 being the test sample.
set.seed(321)
selected <- sample(1:150, replace = FALSE, size = 75)
iris.train <- iris[selected, ]
iris.test <- iris[-selected, ]
There are several packages which implement logistic regression for data with more than two classes. We will use the VGAM package. The function vglm within VGAM implements logistic regression (and much more).
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked _by_ '.GlobalEnv':
##
## logistic
## The following object is masked from 'package:tidyr':
##
## fill
iris.lr <- vglm(Species ~ Petal.Width + Petal.Length, data = iris.train, family = multinomial)
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 6 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 12 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 14 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 16 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 16 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 22 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 25 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 28 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 31 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 36 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 37 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 44 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 45 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 50 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 54 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, :
## convergence not obtained in 21 IRLS iterations
summary(iris.lr)
##
## Call:
## vglm(formula = Species ~ Petal.Width + Petal.Length, family = multinomial,
## data = iris.train)
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,1]/mu[,3]) -1.813e-06 -2.464e-07 4.360e-08 6.302e-08 1.644e-05
## log(mu[,2]/mu[,3]) -1.265e+00 -1.223e-04 6.589e-07 2.993e-03 2.471e+00
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 118.947 9260.552 NA NA
## (Intercept):2 63.510 34.004 NA NA
## Petal.Width:1 -27.338 14731.810 -0.002 0.999
## Petal.Width:2 -14.554 7.466 NA NA
## Petal.Length:1 -26.612 6453.201 -0.004 0.997
## Petal.Length:2 -8.242 5.591 NA NA
##
## Number of linear predictors: 2
##
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
##
## Residual deviance: 8.6819 on 144 degrees of freedom
##
## Log-likelihood: -4.341 on 144 degrees of freedom
##
## Number of iterations: 21
##
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1', '(Intercept):2', 'Petal.Width:2', 'Petal.Length:2'
##
## Reference group is level 3 of the response
Notice that the family is specifed as multinomial rather than binomial since we have more than two classes. When run with these data, the vglm function returns several (about 20) warnings. These occur mainly because the classes are so easily separated, and are suppressed above.
Next we compute the probabilities for the test data.
iris.probs <- predict(iris.lr, iris.test[, c(3, 4)], type = "response")
head(iris.probs)
## setosa versicolor virginica
## 4 1 1.003592e-11 1.129031e-32
## 5 1 1.598575e-12 7.887705e-34
## 9 1 1.598575e-12 7.887705e-34
## 11 1 1.003592e-11 1.129031e-32
## 12 1 6.300597e-11 1.616073e-31
## 14 1 1.799143e-15 1.747518e-38
For these, we want to choose the highest probability in each row.
which.max(c(2, 3, 1, 5, 8, 3))
## [1] 5
which.max(c(2, 20, 4, 5, 9, 1, 0))
## [1] 2
class.predictions <- apply(iris.probs, 1, which.max)
head(class.predictions)
## 4 5 9 11 12 14
## 1 1 1 1 1 1
class.predictions[class.predictions == 1] <- levels(iris$Species)[1]
class.predictions[class.predictions == 2] <- levels(iris$Species)[2]
class.predictions[class.predictions == 3] <- levels(iris$Species)[3]
head(class.predictions)
## 4 5 9 11 12 14
## "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
Next we create the confusion matrix.
table(class.predictions, iris.test$Species)
##
## class.predictions setosa versicolor virginica
## setosa 26 0 0
## versicolor 0 19 1
## virginica 0 2 27
Nearest neighbor methods provide a rather different way to construct classifiers, and have strengths (minimal assumptions, exible decision boundaries) and weaknesses (computational burden, lack of interpretability) compared to logistic regression models.
There are at least three R packages which implement kNN, including class, kknn, and RWeka. We will use class below.
u.knn <- "http://blue.for.msu.edu/FOR875/data/knnExample.csv"
knnExample <- read.csv(u.knn, header=TRUE)
str(knnExample)
## 'data.frame': 200 obs. of 3 variables:
## $ x1 : num 2.5261 0.367 0.7682 0.6934 -0.0198 ...
## $ x2 : num 0.3211 0.0315 0.7175 0.7772 0.8673 ...
## $ class: int 0 0 0 0 0 0 0 0 0 0 ...
ggplot(data = knnExample, aes(x = x1, y = x2)) +
geom_point(aes(color = as.factor(class))) +
theme_bw()
expand.grid(x = c(1, 2), y = c(5, 3.4, 2))
## x y
## 1 1 5.0
## 2 2 5.0
## 3 1 3.4
## 4 2 3.4
## 5 1 2.0
## 6 2 2.0
min(knnExample$x1)
## [1] -2.52082
max(knnExample$x1)
## [1] 4.170746
min(knnExample$x2)
## [1] -1.999853
max(knnExample$x2)
## [1] 2.855805
x.test <- expand.grid(x1 = seq(-2.6, 4.2, by = 0.1), x2 = seq(-2, 2.9, by = 0.1))
library(class)
Example_knn <- knn(knnExample[, c(1, 2)], x.test, knnExample[,3], k = 15, prob = TRUE)
prob <- attr(Example_knn, "prob")
head(prob)
## [1] 0.6666667 0.6666667 0.6666667 0.7333333 0.7333333 0.7333333
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
knn(train, test, cl, k = 3, prob=TRUE)
## [1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v c c c c c v c
## [36] c c c c c c c c c c c c c c c v c c v v v v v c v v v v c v v v v v v
## [71] v v v v v
## attr(,"prob")
## [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667
## [29] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000
## [36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [43] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [50] 1.0000000 1.0000000 0.6666667 0.7500000 1.0000000 1.0000000 1.0000000
## [57] 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000 1.0000000
## [64] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [71] 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667
## Levels: c s v
library(dplyr)
df1 <- mutate(x.test, prob = prob, class = 0, prob_cls = ifelse(Example_knn == class, 1, 0))
str(df1)
## 'data.frame': 3450 obs. of 5 variables:
## $ x1 : num -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
## $ x2 : num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
## $ prob : num 0.667 0.667 0.667 0.733 0.733 ...
## $ class : num 0 0 0 0 0 0 0 0 0 0 ...
## $ prob_cls: num 1 1 1 1 1 1 1 1 1 1 ...
df2 <- mutate(x.test, prob = prob, class = 1, prob_cls = ifelse(Example_knn == class, 1, 0))
bigdf <- bind_rows(df1, df2)
names(knnExample)
## [1] "x1" "x2" "class"
ggplot(bigdf) +
geom_point(aes(x = x1, y = x2, col = class),data = mutate(x.test, class = Example_knn), size = 0.5) +
geom_point(aes(x = x1, y = x2, col = as.factor(class)), size = 4, shape = 1, data = knnExample) +
geom_contour(aes(x = x1, y = x2, z = prob_cls, group = as.factor(class), color = as.factor(class)), size = 1, bins = 1, data = bigdf) +
theme_bw()
Example_knn <- knn(knnExample[, c(1, 2)], x.test, knnExample[,3], k = 1, prob = TRUE)
prob <- attr(Example_knn, "prob")
head(prob)
## [1] 1 1 1 1 1 1
df1 <- mutate(x.test, prob = prob, class = 0, prob_cls = ifelse(Example_knn == class, 1, 0))
str(df1)
## 'data.frame': 3450 obs. of 5 variables:
## $ x1 : num -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
## $ x2 : num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
## $ prob : num 1 1 1 1 1 1 1 1 1 1 ...
## $ class : num 0 0 0 0 0 0 0 0 0 0 ...
## $ prob_cls: num 1 1 1 1 1 1 1 1 1 1 ...
df2 <- mutate(x.test, prob = prob, class = 1, prob_cls = ifelse(Example_knn == class, 1, 0))
bigdf <- bind_rows(df1, df2)
ggplot(bigdf) + geom_point(aes(x = x1, y = x2, col = class), data = mutate(x.test, class = Example_knn), size = 0.5) +
geom_point(aes(x = x1, y = x2, col = as.factor(class)), size = 4, shape = 1, data = knnExample) +
geom_contour(aes(x = x1, y = x2, z = prob_cls, group = as.factor(class), color = as.factor(class)), size = 1, bins = 1, data = bigdf) +
theme_bw()
Next kNN is applied to the diabetes data. We will use the same predictors, glu and bmi that were used in the logistic regression example. Since the scales of the predictor variables are substantially different, they are standardized first. The value k = 15 is chosen for kNN.
Pima.tr[, 1:7] <- scale(Pima.tr[, 1:7], center = TRUE, scale = TRUE)
Pima.te[, 1:7] <- scale(Pima.te[, 1:7], center = TRUE, scale = TRUE)
knn_Pima <- knn(Pima.tr[, c(2, 5)], Pima.te[, c(2, 5)], Pima.tr[, 8], k = 15, prob = TRUE)
table(knn_Pima, Pima.te[, 8])
##
## knn_Pima No Yes
## No 206 55
## Yes 17 54
Now kNN is used to classify the iris data. As before we use petal length and width as predictors
sd(iris.train$Petal.Width)
## [1] 0.728585
sd(iris.train$Petal.Length)
## [1] 1.671873
head(iris.train)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 144 6.8 3.2 5.9 2.3 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 36 5.0 3.2 1.2 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 58 4.9 2.4 3.3 1.0 versicolor
## 50 5.0 3.3 1.4 0.2 setosa
knn_iris <- knn(iris.train[, c(3, 4)], iris.test[, c(3, 4)], iris.train[, 5], k = 1, prob = TRUE)
table(knn_iris, iris.test[, 5])
##
## knn_iris setosa versicolor virginica
## setosa 26 0 0
## versicolor 0 20 1
## virginica 0 1 27
knn_iris <- knn(iris.train[, c(3, 4)], iris.test[, c(3, 4)], iris.train[, 5], k = 3, prob = TRUE)
table(knn_iris, iris.test[, 5])
##
## knn_iris setosa versicolor virginica
## setosa 26 0 0
## versicolor 0 19 1
## virginica 0 2 27
knn_iris <- knn(iris.train[, c(3, 4)], iris.test[, c(3, 4)], iris.train[, 5], k = 15, prob= TRUE)
table(knn_iris, iris.test[, 5])
##
## knn_iris setosa versicolor virginica
## setosa 26 0 0
## versicolor 0 19 1
## virginica 0 2 27
1. Enron email data set
2. Play by play sports data (S.Hauschka 37 yd Field Goal)
3. Unabomber Manifesto
Many applications require the ability to manipulate and process text data. For example, an email spam filter
Two functions to read in text data:
scan function - read data in from a file (more option)scan(file, what, sep)what is the variable typesep is the deliminator (end of file line)sep argument are:" " for the space"\n" for new line"\t" for tabreadLines function - to read url.u.email <- "http://blue.for.msu.edu/FOR875/data/email1.txt"
email1 <- scan(u.email, what = "character", sep = "")
length(email1)
## [1] 133
email1[1:10]
## [1] "From" "safety33o@l11.newnamedns.com"
## [3] "Fri" "Aug"
## [5] "23" "11:03:37"
## [7] "2002" "Return-Path:"
## [9] "<safety33o@l11.newnamedns.com>" "Delivered-To:"
email1[1:10]
## [1] "From" "safety33o@l11.newnamedns.com"
## [3] "Fri" "Aug"
## [5] "23" "11:03:37"
## [7] "2002" "Return-Path:"
## [9] "<safety33o@l11.newnamedns.com>" "Delivered-To:"
email1 <- scan(u.email, what = "character", sep = "\n")
length(email1)
## [1] 26
email1[1:10]
## [1] "From safety33o@l11.newnamedns.com Fri Aug 23 11:03:37 2002"
## [2] "Return-Path: <safety33o@l11.newnamedns.com>"
## [3] "Delivered-To: zzzz@localhost.example.com"
## [4] "Received: from localhost (localhost [127.0.0.1])"
## [5] "\tby phobos.labs.example.com (Postfix) with ESMTP id 5AC994415F"
## [6] "\tfor <zzzz@localhost>; Fri, 23 Aug 2002 06:02:59 -0400 (EDT)"
## [7] "Received: from mail.webnote.net [193.120.211.219]"
## [8] "\tby localhost with POP3 (fetchmail-5.9.0)"
## [9] "\tfor zzzz@localhost (single-drop); Fri, 23 Aug 2002 11:02:59 +0100 (IST)"
## [10] "Received: from l11.newnamedns.com ([64.25.38.81])"
The scan function is quite flexible. In fact, read.table uses scan to actually read in the data. Read the help file for scan if more information is desired.
u.moby <- "http://blue.for.msu.edu/FOR875/data/mobydick.txt"
moby_dick <- scan(u.moby, what = "character", sep = "\n")
moby_dick[1:25]
## [1] "The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville"
## [2] "This eBook is for the use of anyone anywhere at no cost and with"
## [3] "almost no restrictions whatsoever. You may copy it, give it away or"
## [4] "re-use it under the terms of the Project Gutenberg License included"
## [5] "with this eBook or online at www.gutenberg.org"
## [6] "Title: Moby Dick; or The Whale"
## [7] "Author: Herman Melville"
## [8] "Last Updated: January 3, 2009"
## [9] "Posting Date: December 25, 2008 [EBook #2701]"
## [10] "Release Date: June, 2001"
## [11] "Language: English"
## [12] "*** START OF THIS PROJECT GUTENBERG EBOOK MOBY DICK; OR THE WHALE ***"
## [13] "Produced by Daniel Lazarus and Jonesey"
## [14] "MOBY DICK; OR THE WHALE"
## [15] "By Herman Melville"
## [16] "Original Transcriber's Notes:"
## [17] "This text is a combination of etexts, one from the now-defunct ERIS"
## [18] "project at Virginia Tech and one from Project Gutenberg's archives. The"
## [19] "proofreaders of this version are indebted to The University of Adelaide"
## [20] "Library for preserving the Virginia Tech version. The resulting etext"
## [21] "was compared with a public domain hard copy version of the text."
## [22] "In chapters 24, 89, and 90, we substituted a capital L for the symbol"
## [23] "for the British pound, a unit of currency."
## [24] "ETYMOLOGY."
## [25] "(Supplied by a Late Consumptive Usher to a Grammar School)"
You will notice that scan function ignored blank lines in the file. If it is important to preserve blank lines, the argument blank.lines.skip = FALSE can be supplied to scan.
The file containing the novel contains some introductory and closing text that is not part of the original novel.
moby_dick <- moby_dick[408:18576]
length(moby_dick)
## [1] 18169
paste functionThe paste function concatenates vectors after (if necessary) converting the vectors to character.
paste("Homer Simpson", "is", "Bart Simpsons", "father")
## [1] "Homer Simpson is Bart Simpsons father"
n <- 10
paste("The value of n is", n)
## [1] "The value of n is 10"
paste(c("pig", "dog"), 3)
## [1] "pig 3" "dog 3"
By default the paste function separates the input vectors with a space. But other separators can be specified
paste("mail", "google", "com", sep = ".")
## [1] "mail.google.com"
paste(c("dog", "cat", "horse", "human", "elephant"), "food")
## [1] "dog food" "cat food" "horse food" "human food"
## [5] "elephant food"
paste(c("one", "two", "three", "four", "five"), c("six", "seven", "eight", "nine", "ten"))
## [1] "one six" "two seven" "three eight" "four nine" "five ten"
To create one character element, we use collapse
paste(c("one", "two", "three", "four", "five"), c("six", "seven", "eight", "nine", "ten"), collapse = ".")
## [1] "one six.two seven.three eight.four nine.five ten"
paste(c("one", "two", "three", "four", "five"), c("six", "seven", "eight", "nine", "ten"), collapse = " ")
## [1] "one six two seven three eight four nine five ten"
paste(c("a", "b"), 1:10, sep = "")
## [1] "a1" "b2" "a3" "b4" "a5" "b6" "a7" "b8" "a9" "b10"